After talking to Shweta and Rohit once I showed them the results from the last script (03_clustering), it was difficult to say which methods performs clustering better. To get a better idea of what each cluster is at a cell type level, I did a rough cell type annotation ans subsetted the PVM compartment for both methods. In order to streamline things, I kept the resolution for both SCTransform and Seurat VST full objects as 0.2.
Side note about finding markers and any downstream analysis using SCTransform:
There was some confusion regarding performing DE analysis after SCTransform. In this context, I found this (https://github.com/satijalab/seurat/issues/2180 and https://github.com/satijalab/seurat/issues/2115):
“The scale.data (in SCTransform slot) is the pearson residuals that come out of regularized NB regression. The counts and data slot transforms these values back into integer values (stored in counts), and then performs a log-transformation (stored in data).
SCTransform (by default) only stores pearson residuals (scale.data) for 3,000 variable features, to save memory. When possible, we try to perform operations directly on the Pearson residuals themselves. However, these values are not sparse (contain exclusively non-zero elements), so they take a lot of memory to store, and as a result we don’t compute them for all genes by default. Unless you change the defaults in SCTransform, performing DE on the scale.data slot would only test differences in variable genes. We also find that pearson residuals are challenging to visualize/interpret on either Feature or Violin plots, and therefore find the data slot in the SCT assay quite useful for this.
So performing DE on the scale.data slot of this assay means you are only testing 3,000 genes. Performing DE on the RNA assay will test all genes.”
Therefore, I have changed the default assay to RNA, and normalized and scaled the data for markers analysis of the SCTransform object in this document.
Code
DefaultAssay(seurat_sct) <-"RNA"seurat_sct <-NormalizeData(seurat_sct)seurat_sct <-ScaleData(seurat_sct)degs_sct <-FindAllMarkers(seurat_sct, only.pos =TRUE, assay ="RNA",min.pct =0.25, logfc.threshold =0.1)# Extract top 20 genes for each clustertop_genes <- degs_sct %>%group_by(cluster) %>%arrange(desc(avg_log2FC)) %>%slice_head(n =20)# Save to CSV filewrite.csv(top_genes, paste0(out_data_dir,tag,"_markers_sct_res0.2.csv"), row.names =FALSE)#VSTdegs_vst <-FindAllMarkers(seurat_vst, only.pos =TRUE, assay ="RNA",min.pct =0.25, logfc.threshold =0.1)# Extract top 20 genes for each clustertop_genes <- degs_vst %>%group_by(cluster) %>%arrange(desc(avg_log2FC)) %>%slice_head(n =20)# Save to CSV filewrite.csv(top_genes, paste0(out_data_dir,tag,"_markers_vst_res0.2.csv"), row.names =FALSE)save(seurat_sct, file="./output/processed/03.2_SI_seurat_sct.Robj")
These files are also attached in the shared Google folder.
Selection
In the meeting on 03.03.25, we decided to go with SCT since it seems that SCTransform better captures the biological variability between different samples/time_points. To be conservative, we will keep resolution at 0.2 which gives us 13 clusters.
Validation
To validate the clusters we will repeat some of our quality control plots separated by cluster. At this stage we just want to check that none of the clusters are obviously the result of technical factors.
p1 <-VlnPlot(seurat_sct, features ="log10genes_per_UMI") +NoLegend()p2 <-FeaturePlot(seurat_sct, features ="log10genes_per_UMI", label =TRUE) +NoLegend()p1 + p2 +plot_layout(ncol =2)
Summary
After clustering, we have a dataset normalized and scaled using SCTransform, and identified 13 clusters.
Parameters
This table describes parameters used and set in this document.
Code
library(DT)res <-0.2resolutions <-seq(0, 1, 0.1)params <-list(list(Parameter ="Number of genes selected by the VST method",Value ="3000" ),list(Parameter ="Number of genes selected by the SCT method",Value ="3000" ),list(Parameter ="Number of principal components for VST object clustering",Value =18 ),list(Parameter ="Number of principal components for SCT object clustering",Value =13 ),list(Parameter ="Number of neighbours for nearest neighbor graph",Value =30 ),list(Parameter ="Range of possible clustering resolutions",Value ="0-0.9" ),list(Parameter ="Method chosen for downstream analysis",Value ="SCTransform" ),list(Parameter ="Selected resolution parameter for clustering",Value = res ),list(Parameter ="Number of clusters produced by selected resolution",Value ="13" ))# Convert each sublist to a data frame and bind them togetherparams_df <-do.call(rbind, lapply(params, function(x) {data.frame(Parameter = x$Parameter, Value = x$Value, stringsAsFactors =FALSE)}))datatable( params_df,options =list(pageLength =10),rownames =TRUE,)